This document is based on Efron and Tibshirani, An Introduction to the Bootstrap (1993), Efron and Hastie, Computer Age Statistical Inference (2021), Kosuke Imai’s (Harvard) lecture note, and past section notes by previous TAs, Aaron Rudkin and Ciara Sterbenz.

The Bootstrap

My alt text

Formal treatment


Here is population. So, our \(f\) here is \(\text{Multinomial}(100, (\frac{1}{3}, \frac{1}{3}, \frac{1}{3}))\). Should we knew this code, we would be able to learn everything about the population (but can’t).

population <- sample(c(1, 2, 3), 100, replace=TRUE)
population
##   [1] 3 2 3 1 3 1 2 3 3 1 2 3 3 1 1 2 3 2 2 2 2 3 1 3 1 3 3 2 2 2 1 2 3 1 3 1 2
##  [38] 1 2 3 2 2 2 3 3 2 1 3 1 2 3 3 3 2 3 1 2 3 1 2 1 3 1 1 3 1 1 3 2 1 1 3 2 1
##  [75] 2 1 3 1 1 2 1 1 1 1 1 2 3 3 2 3 1 3 1 1 3 1 2 3 1 1

Here is our data \(Y=(Y_1, \cdots Y_{10})\). We will use it to get our empirical version of data generating process, that is \(\hat{F}\).

first_sample <- sample(population, 10, replace=FALSE)

Then bootstrap samples.

bs_replicate <- sample(first_sample, 10, replace=TRUE)

We can see the analogy–Population : Sample :: Sample : Bootstrap Samples.

Why does bootstrapping work?

complicated_data <- as_tibble(c(rnorm(100, 5, 1.5),
                                rnorm(35, 10, 3),
                                runif(20, 0, 2.5),
                                -1, -0.9, -1, -1, 1, 20))

How to implement bootstrap

  1. Take a sample of size \(n\) from our sample with replacement. (This means that each unit from your sample might appear in the resample zero times, one time, or many times.)
  2. Estimate our estimator on each bootstrap sample.
  3. Repeat steps 1 and 2 many times (\(B\) times), forming a sampling distribution for our estimator.
  4. Use the sampling distribution of the estimator (bootstrap distribution) to conduct inference: e.g., calculate the standard error and get a percentile confidence interval.



Examples of Bootstrap with R

1. Nonparametric Bootstrap

  • Let’s look at some real data–this is from the U.S. National Health Interview Study, and we’re going to treat it as our population (\(N = 4,347\)). Let’s use non-parametric bootstrap (resample data without any distributional/model assumption on DGP) to assess the standard error of sample mean and sample median of Height.
library(tidyverse)
#data
health_data <- read.csv("height_data.csv")
health_data <- health_data %>% filter(height <= 76)
nrow(health_data) #we'll treat it as population data with n=4,347
## [1] 4347
#Sample estimate of mean height and median/mean height
sample_data <- sample_n(health_data, size = 100) #draw our sample n=100
observed_mean <- mean(sample_data$height)
observed_median <- median(sample_data$height)

#let's make a function for obtaining bootstrap sampling distribution of mean & median
boot_samp_dist <- function(data, B){
  replicate(B, { #B: number of reps
    #1. get bootstrap sample (note: resample indices, not data themselves for efficient computation)
    indices <- sample(1:nrow(data), nrow(data), replace=TRUE) 
    boot_samp <- data[indices, "height"] #using row indices
    #2. estimate our statistics (mean/median) on each bootstrap sample and collect them
    boot_mean <- mean(boot_samp)
    boot_median <- median(boot_samp)
    c(boot_mean = boot_mean, boot_median = boot_median) #return them as a vector
  }, simplify = FALSE) %>% bind_rows() #collect to form a sampling distribution
}

#3. repeat steps 1 and 2 `rep` times -> boot samp dist.
boot_dist <-boot_samp_dist(data=sample_data, B=500) #B=500

Here is an animated plot showing the construction of the bootstrap sampling distribution for each step.

Again, the standard error of an estimator is the standard deviation of the sampling distribution.

##4. Inference for sample mean
#standard error
se_mean_boot <- sd(boot_dist$boot_mean)
se_mean_boot
## [1] 0.4131497

A confidence interval for \(\hat{\theta}\) can be constructed using the standard normal confidence interval procedure. Another (purely nonparametrical) approach is percentile intervals: to take the quantiles of the bootstrap sampling distribution of \(\hat{\theta^*_b}\). For example, 95% percentile confidence interval is \[\hat{\theta}^*_{b, (lower)} < \theta < \hat{\theta}^*_{b, (upper)},\] where the \(\hat{\theta}^*_b\) are the \(B\) ordered bootstrap replicates, and \(\hat{\theta}^*_{b, (lower)} = .025 \cdot B\) and \(\hat{\theta}^*_{b, (upper)} = .975 \cdot B\) (There are some approached to improve bootstrap intervals: for example, “bias-corrected, accelerated” or “BCa” adjusts for bias and akewness in the bootstrap distribution; see Efron and Tibshirani book.)

#confidence intervals using percentile method
quantile(boot_dist$boot_mean, c(0.05/2, 1-0.05/2))
##     2.5%    97.5% 
## 66.04375 67.71525
#CIs via normal approximation
mean(boot_dist$boot_mean) + c(qnorm(0.025), qnorm(0.975))*se_mean_boot
## [1] 66.06494 67.68446

Ok, let’s compare bootstrap standard error and percentile CIs to analytic solutions (\(se(\bar{X}) = \sqrt{\frac{Var(X)}{n}}\)).

#standard error of sample mean
se_est <- sqrt(var(sample_data$height)/nrow(sample_data))
se_est
## [1] 0.4254279
#t CIs of sample mean
observed_mean + c(-qt(1-.05/2, df=nrow(sample_data)-1), qt(1-.05/2, df=nrow(sample_data)-1))*se_est
## [1] 66.04586 67.73414

Here is a more general function for bootstrap using base R style.

#Source: Efron and Hastie (2021) `Algorithm 10.1` with modification
Boot <- function (x, B, func, ...){
  # x is data vector or matrix (with each row a case)
  # B is number of bootstrap replications
  # func is R function that inputs a data vector or matrix and returns a numeric number or vector
  # ... other arguments for func
  x <- as.matrix(x)
  n <- nrow(x)
  f0 = func(x, ...) #to get size of would-be estimate
  fmat <- matrix(0, length(f0), B)
  for (b in 1:B) {
    i <- sample(1:n, n, replace = TRUE) #resample indices
    fmat[, b] <- func(x[i, ], ...)
  }
  drop(fmat) 
}

sd(Boot(x=sample_data$height, B=500, func = mean)) #standard error of sample mean
## [1] 0.4166802
sd(Boot(x=sample_data$height, B=500, func = median)) #standard error of sample median
## [1] 0.944179
#Boot function is flexible in that we can feed our custom function
#say our weird estimator is median of weight/height ratio
estimate_function <- function(data){
  median(data[, 1]/data[, 2])  
}
sd(Boot(x=cbind(sample_data$weight, sample_data$height), B=500, func=estimate_function))
## [1] 0.0677967

The boot function in boot package is good when you have a simple bootstrap and a clear function which returns a quantity of interest. At the moment, it’s more important to understand how it works with line by line code so I’ll skip it. We’ll use the package later in the course whenever we need to use bootstrap.


2. Parametric Bootstrap

  • Suppose we are confident that Height is normally distributed. We could use parametric bootstrap: we use point estimated sample mean and sample sd using the data for our parameter values. That is, now we draw directly from the assumed parametric model derived from the sample \(F_{\color{red}{\hat{\theta}}} =N( \color{red}{\hat{\mu}}, \color{red}{\hat{\sigma}})\) rather than observations themselves (or their empirical CDF). If our model is sufficiently good to represent the true data generating process, the parametric bootstrap would be valid.

  • The Height example continues. We have Height data \[Y_i \overset{iid}{\sim} N(\mu, \sigma),\] for \(i=1, \cdots, n\), and we set \(n=100\). The observed estimate of mean is \(\hat{\theta}=g(Y)=\frac{\sum_i Y_i}{n}\). We plug in our estimates for the parameters of normal distribution as \(\hat{\mu}=\bar{Y}\) and \(\hat{\sigma}=s\), where \(\bar{Y}\) is sample mean and \(s\) is sample sd. Then a parametric bootstrap sample is \[Y_i^* \overset{iid}{\sim} N(\bar{Y}, s),\] for \(i=1, \cdots, n\) and \(\hat{\theta}^*_b = g(Y^*)=\sum_i Y_i^*\) is a bootstrap sample mean computed from \(Y_i^*\). \(B\) bootstrap sample would give the sampling distribution of \(\hat{\theta}^*_b\).

## parametric bootstrap with Normal model
#parameter estimates (MLE)
mu_hat <- mean(sample_data$height) #plug-in sample mean for mu_hat
sigma_hat <- sd(sample_data$height) #sample variance for sigma_hat

#get parametric bootstrap samples: just plug-in observed parameter values and draw B times
par_boot_samples <- replicate(500, { #B=500, same as nonparametric case for illustration
    mean(rnorm(100, mean = mu_hat, sd = sigma_hat)) #n=100, same size as sample data
  }, simplify = TRUE) #bootstrap dist of sample mean, output as a vector

sd(par_boot_samples) #parametric bootstrap standard error
## [1] 0.4168464


3. Block (Clustered) Bootstrap: when i.i.d. is violated at unit level

  • The block boostrap is used to handle clustering when we have a complicated sampling structure that samples larger units (e.g., classroom, school, district, village, household, etc.) i.i.d., but within a cluster takes individual draws that are not i.i.d. For example, we sample i.i.d. districts in California, then within a district we collect the test scores of all students in the 6th grade–not i.i.d.!–and take the mean.

  • If you have experimental data and blocked assignment, or data contain clustered structure, you should resample the clusters (not units!) with replacement. This method is called the block/cluster bootstrap, which is exactly like the bootstrap except that now we resample whole clusters rather than individual units. To conduct the (non-parametric) block bootstrap:

  1. Take a with replacement sample of \(\#\{ \text{clusters} \}\) clusters from our sample, keeping clusters intact.
  2. Calculate our would-be estimate using this collected \(n\) observations in our block bootstrap sample.
  3. Repeat steps 1 and 2 many times, forming a sampling distribution for our estimator.
  4. Use the sampling distribution of the estimator (block bootstrap distribution) to conduct inference: calculate the standard error, get a percentile confidence interval, etc.


Let’s take an example. Does school enrollment impact academic performance? Let’s regress the academic performance index in 2000 (api00) on the total enrollment at a school (enroll), with two covariates, the percentage of fully qualified teachers (full) and the percentage of students eligible for subsidized meals (meals): \[y_i = \beta_0 + \beta_1 \cdot enroll_i + \beta_2 \cdot full_i + \beta_3 \cdot meals_i + \epsilon_i\] The regression coefficient \(\beta_1 = \beta_{\text{enroll}}\) is our quantity of interest.

Note that the apiclus1 dataset is made by cluster sampling–random sampling of 15 entire districts than schools, and each observation is individual school. We might think that there are district specific factors that lead to distinct academic performance, and hence the schools within same district are not independent.

How to get the standard error for something this complicated? Just replicate the exact sampling design and estimation in a bootstrap!

My alt text

library(survey) #to get api data
library(stargazer)
data(api)

#15 districts
unique(apiclus1$dnum)
##  [1] 637 437 778 197 406 815 178 255 568 135 510 716  61 413 448
#make a function for block/cluster bootstrap
block_boot_lm <- function(B, data, cluster, formula){
  boots <- replicate(B, {
    #1. get sample of CLUSTERS (not units) with replacement
    clusters <- sample(unique(data[, cluster]), replace=TRUE)
    #for each sampled cluster, build up a new dataset -> bootstrap data
    boot_data <- map_df(clusters, function(x){
      data[data[, cluster] == x, ] #`data[, cluster]==x` returns row indices vector; then sample
    })
    #2. estimate regression coefficients on each bootstrap sample and collect them
    boot_lm <- lm(formula, data = boot_data) #one bootstrap statistic/estimate
    return(as_tibble( t(coef(boot_lm)))) #transpose to make it row vector, to stack row-wise
  }, simplify = FALSE) %>% bind_rows() #bootstrap distribution
  return(boots)
}

#run a regression on original data: "observed" estimates
lm_res <- lm(api00 ~  enroll+meals+full, data=apiclus1)
#obtain block bootstrap distribution
block_boot_lm_res <- block_boot_lm(1000, apiclus1, "dnum", api00 ~ enroll+meals+full) #B=1,000

#compare to nominal standard error
block_boot_se <- apply(block_boot_lm_res, 2, sd) #compute standard deviation for each coefficient
names(block_boot_se) = names(lm_res$coefficients)
stargazer(lm_res, lm_res, type="text",
          title="Regression with and without Clustered SEs",
          se = list(sqrt(diag(vcov(lm_res))), block_boot_se),
          dep.var.labels="Academic Performance Index in 2000",
          column.labels = c("Nominal", "Cluster-Bootstrap"),
          omit.stat=c("LL", "ser", "f"), no.space=TRUE)
## 
## Regression with and without Clustered SEs
## ================================================
##                      Dependent variable:        
##              -----------------------------------
##              Academic Performance Index in 2000 
##                   Nominal      Cluster-Bootstrap
##                     (1)               (2)       
## ------------------------------------------------
## enroll           -0.070***         -0.070***    
##                   (0.010)           (0.012)     
## meals            -3.281***         -3.281***    
##                   (0.145)           (0.283)     
## full             1.383***          1.383***     
##                   (0.351)           (0.529)     
## Constant        727.270***        727.270***    
##                  (34.653)          (51.891)     
## ------------------------------------------------
## Observations        183               183       
## R2                 0.791             0.791      
## Adjusted R2        0.787             0.787      
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

Finally, here is our block bootstrap sampling distribution of the regression coefficient of enroll variable, compared to simple bootstrap method.

##Comparison of sampling distibution
#simple bootstrap for lm
boot_lm <- function(B, data, formula){
  boots <- replicate(B, { #B: number of reps
    #1. get bootstrap sample (note: resample indices, not data themselves for efficient computation)
    indices <- sample(1:nrow(data), nrow(data), replace=TRUE) 
    boot_data <- data[indices, ] #using row indices
    #2. estimate reg coef on each bootstrap sample and collect them
    boot_lm <- lm(formula, data = boot_data) #our bootstrap distribution
    return(as_tibble( t(coef(boot_lm)))) #transpose to make it row vector, to stack row-wise
  }, simplify = FALSE) %>% bind_rows()
  return(boots)
}
boot_lm_res <- boot_lm(1000, apiclus1, api00 ~ enroll+meals+full) #B=1,000
apply(block_boot_lm_res, 2, sd)
## (Intercept)      enroll       meals        full 
## 51.89050362  0.01235661  0.28283590  0.52910802

Quick review quiz

Q. Say we have a boostrapped sampling distribution of mean estimates from our bootstrapped samples, how do we calculate the SE estimate? What about the usual SE estimate?

  1. sd(boot_sampling_distribution)/sqrt(nrow(sample_data))
  2. sd(boot_sampling_distribution)/B
  3. sd(boot_sampling_distribution)
  4. sd(sample_data)
  5. sd(sample_data)/sqrt(nrow(sample_data))

Summary